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Abstract 

We consider the Max iT-Armed Bandit problem, where a learning agent is faced 
with several sources (arms) of items (rewards), and interested in finding the best 
item overall. At each time step the agent chooses an arm, and obtains a random 
real valued reward. The rewards of each arm are assumed to be i.i.d., with an 
unknown probability distribution that generally differs among the arms. Under the 
PAC framework, we provide lower bounds on the sample complexity of any (e, S)- 
correct algorithm, and propose algorithms that attain this bound up to logarithmic 
factors. We compare the performance of this multi-arm algorithms to the variant 
in which the arms are not distinguishable by the agent and are chosen randomly 
at each stage. Interestingly, when the maximal rewards of the arms happen to be 
similar, the latter approach may provide better performance. 


1 Introduction 

In the classic stochastic multi-armed bandit (MAB) problem the learning agent faces a set K of 
stochastic arms, and wishes to maximize its cumulative reward (in the regret formulation), or find 
the arm with the highest mean reward (the pure exploration problemV This model has been studied 
extensively in the statistical and learning literature, see for example |[lt] for a comprehensive survey. 

We consider a variant of the MAB problem called the Max AT-Armed Bandit problem (Max-Bandit 
for short). In this variant, the objective is to obtain a sample with the highest possible reward 
(namely, the highest value in the support of the probability distribution of any arm). More precisely, 
considering the PAC setting, the objective is to return an (e, i5)-correct sample, namely a sample 
which its reward value is e-close to the overall best possible reward with a probability larger than 
1—5. In addition, we wish to minimize the sample complexity, namely the expected number of 
samples observed by the learning algorithm before it terminates. 

Eor the classical MAB problem, algorithms that find the best arm (in terms of its expected reward) 
in the PAC sense were presented in ||2.[3l0]j and lower bounds on the sample complexity were pre¬ 
sented in 0 and ||3- The essential difference with respect to this work is in the objective, which is 
to find an (e, 5)-correct sample in our case. The scenario considered in Max-Bandit model is rele¬ 
vant when a single best item needs to be selected from among several (large) clustered sets of items, 
with each set represented as a single arm. These sets may represent parts that come from different 
manufacturers or produced by different processes, job candidates that are referred by different em¬ 
ployment agencies, finding the best match to certain genetic characteristics in different populations, 
or choosing the best channel among different frequency bands in a cognitive radio wireless network. 

The Max-Bandit problem was apparently first proposed in Eor reward distribution functions 
in a specific family, an algorithm with an upper bound on the sample complexity that increases as 
~ was provided in n. Eor the case of discrete rewards, another algorithm was presented in 
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Is}] , without performance analysis. Later, a similar model in which the objective is to maximize the 
expected value of the largest sampled reward for a given number of samples (n) was studied in 
In that work the attained best reward is compared with the expected reward obtained by an oracle 
that samples the best arm n time. An algorithm is suggested and shown to secure an upper bound 
of order n~°‘ on that difference, where a < 1 is determined by the properties of the distribution 
functions and decreases as they are further away from a specific functions family. 

Our basic assumption in the present paper is that a known lower bound is available on the tail distri¬ 
butions, namely on the probability that the reward of each given arm will be close to its maximum. 
A special case is when the probability densities near the maximum are larger than a given value, 
but we consider more general function classes. Under that assumption, we provide an algorithm for 
which the sample complexity increases as at most ~ in(<t)^-in(e) ^ provides an improvement by 
a factor of over the result of (01, which was obtained for a more specific model. To compare 
with the result in we observe that with a choice of 5 = ^ in our algorithm, we obtain that 
the expected shortfall of the largest sample with respect to the maximal reward possible is at most 
of order ) (as compared to 0{n~°‘) with a < 1). Furthermore, we provide a lower bound 

on the sample complexity of every (e, 5)-correct algorithm, which holds when several arms posses 
maximal rewards that are close to that of the best arm. This lower bound is shown to coincide, up to 
a logarithmic term, with the upper bound derived for the proposed algorithm. 

A basic feature of the Max-Bandit problem (and the associated algorithms) is the goal of quickly 
focusing on the best arm (in term of maximal reward), and sampling from that arm as much as 
possible. It should be of interest to compare the obtained results with the alternative approach, 
which ignores the distinction between arms, and simply draws a sample from a random arm (say, 
with uniform probabilities) at each round. This can be interpreted as mixing the items associated 
with each arm before sampling; we accordingly refer to this variant as the unified-arm problem. 
This problem actually coincides with the so-called infinitely-many armed bandit model studied in 
|[l^llll[T^[r^[l4l] . for the specific case of deterministic arms studied in ([fSll . The conclusion about 
weather to apply the multi-arm approach or the unified-arm approach is inconclusive. However, as 
a rule of thumb, when the maximal possible rewards of many arms are far from the optimal, the 
multi-arm approach has better performance. 

The paper proceeds as follows. In the next section we present our model. In Sectionj^we provide 
a lower bound on the sample complexity of every (e, ^)-correct algorithm. In Section|4]we present 
two (e, ^)-correct algorithms, and we provide an upper bound on the sample complexity of one of 
them. The first algorithm is simple and its bound has the same order as the lower bound up to a 
logarithmic term in (where \K\ stands for the number of arms), the second algorithm is more 

complicated and we believe that its bound is larger by up to a double logarithmic term in than 
the lower bound. In Section |5] we consider for comparison the unified-arm case. In Section |6] we 
close the paper by some concluding remarks. Certain proofs are differed to the Appendix due to 
space limitations. 

2 Model Definition 

We consider a finite set of arms, denoted by AT. At each stage t = 1,2,... the learning agent chooses 
an arm k G K, and a real valued reward is obtained from that arm. The rewards obtained from each 
arm k are independent and identically distributed, with a distribution function (CDF) /r S K. 

We denote the maximal possible reward of each arm by = inf^gR{/r|Ffc(/i) = 1}, assumed 
finite, and the maximal reward among all arms by /r* = max^g^f 

Throughout the paper, we shall make the following assumption. 

Assumption 1. There exist known constants A > 0, /3 > 0 and eo > 0 such that, for every k G K 
and 0 < e < eo, it holds that 

P iPk > Afc - e) > , 

where fjb^. stands for a random variable with distribution Fk- 

The bound in the above assumption can also be expressed as 1 — Fk{ii) > A . This 

condition required fj,/. to have a certain mass near its maximal reward. Note that the specific case 
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of /3 = 1 is satisfied if the densities are lower bounded by a constant A. Values of /3 < 1 
accommodate leaner tales. 

The upper bound on the CDF Fk ensures that for each arm, an e-optimal reward can be observed by 
a finite number of samples. The bound in the above assumption is similar to those assumed in ifl^ 
and iTsIl . 

An algorithm for the Max-Bandit model samples an arm at each time step, based on the observed 
history so far (i.e., the previously selected arms and observed rewards). We require the algorithm 
to terminate after a random number T of samples, which is finite with probability 1, and return a 
reward V which is the maximal reward observed over the entire period. An algorithm is said to be 
(e, S)-correct if 

PiV> fi* -e)>l-S. 

The expected number of samples E[T] taken by the algorithm is the sample complexity, which we 
wish to minimize. 


3 A Lower Bound 


Before turning to our proposed algorithm, we provide a lower bound on the sample complexity of 
any (e, (5)-correct algorithm. The bounds holds under Assumption[T]when /3 < 1. The case of /3 > 1 
is more complicated for analysis and it still unclear whether our lower bound holds for this case. 


The following result specifies the lower bound of this section. 


Theorem 1. Suppose Cq ^ /3 < 1 and let e G (0, Eq) tind S € (0,1). Let k* denote some 

optimal arm, such that = p,*. Then, under Assumption\J] for every (e, 5)-correct algorithm, it 
holds that 


E[T] > y - 

k^K\{k*}8A {mm{eo,e + p* 



( 1 ) 


This lower bound can be interpreted as summing over the minimal number of times that each arm, 
other than the optimal arm k*, needs to be sampled. It is important to observe that if there are several 
optimal arms, only one of them is excluded from the summation. Indeed, the bound is most effective 
when there are several optimal (or near-optimal) arms, as the denominator of the summand is larger 
for such arms. This may appear surprising at first, as more sources of good rewards are available; 
however, when there is a single arm that is strictly better than the others it can be quickly singled 
out, while if many arms have nearly optimal rewards, more samples are ’’waisted” on determining 
which arm is best. 

The proof of Theorem [T] is provided in Appendix A and proceeds by showing that if an algorithm 
is (e, ^)-correct and its sample complexity is lower than a certain threshold for some set of reward 
distributions, then this algorithm cannot be (e, 5)-correct for some related reward distributions. 


4 Algorithms 

Here we provide two (e, (5)-correct algorithms. The first algorithm is based on sampling the arm 
which has the highest upper confidence bound on its maximal reward at each time step and the 
second algorithm is based on arms elimination. 

4.1 Maximal Confidence Bound 

The algorithm starts by sampling a certain number of times from each arm. Then, it repeatedly 
calculates an index for each arm which can be interpreted as a certain upper bound on the maximal 
reward of this arm, and samples once from the arm with the largest index. The algorithm terminates 
when the number of samples from the arm with the largest index is above a certain threshold. This 
idea is similar to that in the UCBl Algorithm provided in lH^ . 
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Algorithm 1 Maximal Confidence Bound (Max-CB) Algorithm 
1: Input: Model parameters eo > 0, A > 0 and /3 > 0, constants 5 > 0 and e > 0. 

Define L = 61n (^\K\ ^^ ■ 

2: Initialization: Counters C{k) = Nq, k £ K, where Nq = + 1- 

Agq 

3: Sample Nq times from each arm. 

4: Compute + e^^{C{k)) and set k* € argmaxkeK ^c(/c) (with tie broken arbi¬ 

trary), where is the largest reward observed so far from arm k and 

5: If e^^{C (k*)) < e, stop and return the largest sampled reward. 

Else, sample once from arm k*, set C{k*) = C{k*) + 1 and return to step|4] 


Theorem 2. Under Assumption^] for L > 10, Algorithm\I]is {e,S)-correct with a sample complex¬ 
ity of 


E[T] < Yy 

kGK 


L - In(^) 

A {max {e,n* - pl)f 


+ \K\No, 


where Nq = [ ^ J + 1 L = 61n ^|Ar| H —defined in the algorithm. 


In the following corollary we present the ratio between the lower bound presented in Theorem[T]to 
the upper bound in Theorem|2] 

Corollary I. If there are more than one arm for which pf. £ [p* — e, p*], then the upper bound on 
the sample complexity is of the same order as the lower bound in Theorem \i\ up to a logarithmic 

f , - 1*^1 

factor in 

Proof. For every k £ K it follows that 




1-f 2^ 


> 


2^ 


1 

> 


1 


(mm(eo,e-l-p* -pDf (e-h p* - p^f eg {max{e, p* - pDf 


^ A o2 
S — 


and for every two arms k' and k* for which £ [p* — e, p*] and = p* it is obtained that 

01 , > 2 -/ 501 . _ ( 2 ) 

In addition, the lower bound is of the same order as 

(3) 


-ln(5) ^ 0 

k^K\{k*} 


1 

k ’ 


the upper bound is of the same order as 

(L - ln(<5)) ^ Ql , 

kGK 


Therefore, the upper bound in Theorem^is of the same order of the lower bound in Theorem\J]up 
to an order of ’ whic/i is logarithmic in 


□ 


To establish Theorem we first bound the probability of the event under which the upper bound 
of the best arm is below the maximal reward. Then, we bound the largest number of samples after 
which the algorithm terminates under the assumption that the upper bound of the best arm is above 
the maximal reward. 
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Proof (Theorem|2]). We denote the time step of the algorithm by t, and the value of the counter C{k) 
at time step t by C^{k). Recall that T stands for the random final time step. By the condition in step 
\^of the algorithm, for every arm k & K, it follows that, 

c^{k) < + 1 . ( 4 ) 

Note that by the fact that for X > 6 it follows that < 1, and by the fact that for xq = exp(l|) 

it follows that xq > 61n(a:o) = 10 it is obtained that 

for L > 10. So, by the fact that T = /or L > 10 it follows that 


T < \K\ 


/L-\n{S) 
V AeP 



<\K\ 


/L' -\n{6) 



< L'2 


L 

= . 


(5) 


Now, we begin with proving the (e, 6)-correctness property of the algorithm. Recall that for every 
arm k € K the rewards are distributed according to the C.D.F. Fk{pf). Let assume w.l.o.g. that 
pI = pi*. Then, for N > 0 and by the fact that (1 — e)« < e~^ for every e S (0, l],/or {N) = 

( ) it follows that 

P {V^ <T*- e^^(iV)) = (Fi {p,* - e^^{N))f < {l - A {e^^iN)fY < Se-^, (6) 

where is the largest reward observed from arm k G K after this arm has been sampled for N 
times. Hence, at every time step t, by the definition ofY^tf^i'^ and Equations Q and (|6]), by applying 
the union bound, it follows that 


exp(i) 

(7) 

Since by the condition in step\^ it is obtained that when the algorithm stops 






e, 


and by the fact that for every time step 


it follows by Equation dTJl that 


Pc*{k*) ^ 


P 


b* ~ ^ ^ P (^c*(l) - k*') - ^ ■ 


Therefore, it follows that the algorithm returns a reward greater than p* — e with a probability 
larger than 1 — <5. So, it is (e, 6)-correct. 


For proving the bound on the expected sample complexity of the algorithm we define the following 
sets: 

M(e) = {IG K\p* - p: < e}, iV(e) = {I G K\p* - p^ > e}. 

Ai before, we assume w.l.o.g. that p\ = p*. For the case in which 

n {YcHD^k}, 

l<t<T 

occurs, since < p’^for every k G K, and every time step, it follows that the necessary 

condition for sampling from arm k, 

yrk ^ T^l 
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occurs only when the event 


occurs. But 


Therefore, it is obtained that 


c^(fc)<L + 


( 8 ) 


By using the bound in Equation ©/or the arms in the set M{e), the bound in Equation ^for the 
arms in the set N{e) and the bound in Equation Q, it is obtained that 


E[T] <{l-P {El)) e 3 + P {El) $ (e), 


(9) 


where 


*(«)= E U 


L - 1ti(«) 


kGN{t) V ^ (d* Bk)' 


r-l + M ^ (L 


keM(e) 


L - ln(,5) 
~A^ 


J + 1 


In addition, by Equation O, the bound in Equation © and by applying the union bound, it follows 
that 

T 

P{Ei) > 1 ^ 

So, 

l-P{Ei) < 6e-^. 

Furthermore, by the definitions of the sets N{e) and M{e), it can be obtained that 

. L — ln(<5) 

$ (e) E ^ [-o J + 1- 

Therefore, by Equation ®, (doll and ([II]l the bound on the sample complexity is obtained. 


( 10 ) 


( 11 ) 


□ 


4.2 Maximal Eliminator 

The algorithm starts by sampling a certain number of times from each arm. Then, it repeatedly 
calculates an index for each arm which can be interpreted as a certain upper bound on the maximal 
reward of this arm, and eliminates arms for which that index is below the maximal sampled reward 
so far. Then it sample from only the retained arms (those arms which have not been eliminated) a 
number of times that is doubled at each sampling phase. This idea is similar to that in the Median 
Elimination Algorithm provided in [|2l . 

We do not provide performance analysis for Algorithm |2] However, since the number of times at 
which the confidence bounds should be correct (times at which the algorithm eliminates arms) is 
only logarithmic in the number of total samples, we have = ln(2L) (where L is defined in 

Algorithm [Hand the factor 2 arises because of the doubling). Therefore, we believe that the upper 
bound on the sample complexity of Algorithmic would be that of Algorithm[T]multiplied by Ak .— 
So, the upper bound would be of the same order of the lower bound in Theorem [T] up to double 
logarithmic terms. 

5 Comparison with The Unified-Arm Model 

In this section, we analyze the improvement in the sample complexity obtained by utilizing the 
multi arm property (the ability to choose from which arm to sample at each time step) compared 
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Algorithm 2 Maximal Eliminator (ME) Algorithm 
1: Input: Model parameters eo > 0, A > 0 and /3 > 0, constants S > 0 and e > 0. 

Define = In (l2 In [\K\ (l + ^:^)) ) • 

2: Initialization: A set of arms Kt=i = K and counter t = 1. 

3: Sample Nt times from each arm in the set Kt, where Nt = 2*“^ ^ [-—J 
4: Compute + €^^{Nt+i — Nq), 



where is the largest reward observed from arm k and e 


UB 


(N) = 


AN 




1//3 


5: If e^^{Nt+i — Nq) < e, stop and return the largest sampled reward. 

Else, set t = t + 1, Kt = {k € Kt-i\Y^ > maxj^Kt-i return to step[3] 


Algorithm 3 Unified-Arm Algorithm 
1: Input: Constants d > 0, e > 0. 

2: Sample r ■*" ^ from the arm. 

3: Return the best sampled arm. 


to a model in which all the arms are unified into a unified arm, so that the sample is effectively 
obtained from a random arm. In the unified-arm model, when the agent samples from this unified 
arm, a certain arm (among the multi arm) is chosen uniformly and a reward is sampled from this 
arm. We denote the CDE of the unified arm as F{ii), with F = By Assumption[T] 

1 — F{fj,) > . and the corresponding maximal reward is /i*. 

In the remainder of this section, we provide a lower bound on the sample complexity and an (e, <))- 
correct algorithm that attains the same order of this bound for the unified-arm model. (Note that the 
lower bound in Theorem [T] is meaningless for |iT| = 1.) Then, we discuss which approach (multi¬ 
arm or unified-arm) is better for different model parameters, and provide examples that illustrate 
these cases. 


5.1 Lower Bound 


The following Theorem provides a lower bound on the sample complexity for the unified-arm model. 

Theorem 3. Suppose eo < (’^) ^ ^ ^ S G (0, 1). Then, under Assumption 

\I\for every (e, 5)-correct algorithm, it holds that 


E[T] > 


\K\ 

AAeP 


In 



( 12 ) 


The proof is provided in Appendix B and is based on the a similar idea to that of Theorem[Tl 


5.2 Algorithm 


In Algorithm [3 a certain number of rewards is sampled, and the algorithm chooses the best one 
among them. In the following Theorem we provide a bound on the sample complexity achieved by 
Algorithm^ 

Theorem 4. Under Assumption\I\ Algorithm\^is (e, S)-correct, with a sample complexity bound of 


E[T] < 


|iT|ln(ri) 

AeP 


+ 2 . 


The proof is provided in Appendix C. Note that the upper bound on the sample complexity is of the 
same order as the lower bound in Theorem[3] 
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5.3 Comparison and Examples 

To find when the multi-arm algorithm is helpful, we can compare the upper bound on the sample 
complexity provided in Theorem |2] for Algorithm [T] (multi-arm case) with the lower bound for the 
unified-arm model in Theorem|3] 

Case 1: Suppose first that arm 1 is best: = jj*, while all the other arms fall short significantly 

compared to the required accuracy e: fi* — e, for k ^ 1. 

In this case i ^ - - - — - -for k ^ 1. Hence the upper bound on sample complexity of 

jj 

Algorithm [T] (multi-arm case) will be smaller than the lower bound for the unified-arm model in 
Theorem[3 We now provide an example which illustrate case 1 numerically. 

Example 1 (Case 1). Let \K\ = 10^ pj = 0.9, ^ 1 % = 0.1 Vfc e AT \ {1}, P = landA = 0.01. For 
e = 10““^ and 6 = 10“^ the sample complexity attained by Algorithm\I\is 3.52 x 10®. The lower 
bound for the unified-arm model is 1.59 x 10^®. The sample complexity attained by Algorithm\^(for 
the unified-arm model model) is 6.9 x 10^*^. 

Case 2: Consider next the opposite case, where there are many optimal arms and few that are worse: 
say pI pi* — e, while pt^ = p* for all fc ^ 1. 

In this case for k 1. Hence, since there is a logarithmic-in-— multi- 

j) 

plicative factor in the upper bound on the sample complexity of Algorithm[T] (multi-arm case), this 
bound will be larger than the lower bound for the unified-arm model in Theorem[3 The following 
example illustrate case 2 numerically. 

Example 2 (Case 2). Let \K\, A, fi, 5 and e remain the same as in Example\J] and let p\ = 0.1 and 
pI. = 0.9 \/k G K \ {!}. The sample complexity of Algorithm\I\is 1.56 x 10^^, which is larger than 
the sample complexity of Algorithmf^which is 6.9 x 10^*^. 

As shown in Example|2] in some cases the bound on the sample complexity of Algorithm[T] (multi¬ 
arm) is larger than that of Algorithm [3 (unified-arm). By comparing the upper bounds of these 
algorithms, we believe that the logarithmic in factor in the bound of Algorithm [T] may not be 
required. 

As observed by comparing the lower and upper bounds for the multi-arm and the unified-arm model, 
the unified-arm algorithm provides a tighter upper bound (compared to the matching lower bound). 
Therefore, when the benefit obtained by the multi-arm model is small (i.e., when there are a lot of 
good arms) the profit obtained by applying the multi-arm Algorithm turns out to be loss. 

6 Conclusion 

In this paper we have developed corresponding lower and upper bounds on the sample complexity, 
which are essentially the same order up to a logarithmic term in for the Max if-Armed Bandit 
problem. 

These results were compared to the unified-arm model, where the learning algorithm effectively 
unifies the different arms into one. While the multi-arm algorithm usually performs better, in some 
cases, in particular when most arms are optimal, the unified arm algorithm may provide better per¬ 
formance. It still remains to be shown whether an algorithm that provides the performance benefits 
of both approaches may be devised. 

Another direction for future work concerns the relaxation or generalization of our Assumption 1, 
which requires a known lower bound on the tail distribution of the rewards. 
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7 Appendix A 


Proof (TheoremlTJ. Let < 1 — Ae^} for every k G K. Then, we define the 

following set of hypotheses {Hq, Hi, 

Ho-- /f"(M)=/fc(M) ykGK, 

and, for every k = 1,..., \K\, 


Hk : 




ife-o < (m* 

iff-o > {k* 


kk + e)-' 

kl + e) -■ 


/fH^) =7fc/fe(Ai)l(-oo,7i,)(Ai) + lkfk{k)Hk = kk) + /fe(M)l(7i,,Mj](M) 
+ AP{p*+e-p)^ ^ l(^*,^*+e](M) 


where fkik) ike probability density function of arm k G K, \q stand for the indicator function of 

the set 0, 7 ^ = 1 — Ae^, 7 ^ = 1 — chosen such that f^’’ {k)dk = 1 - 

Note that since for every xi,X 2 > 0 it follows that x^+x^^ {xi + xf}^ for /3 < 1, Assumption\I\ 
holds for hypotheses {Hi,..., H\x\\- 

To further bound yf. and 7 ^, note that since ep < (4A) , 

1 — 2 A {p* — /ifc + e)^ < 7 fe < 1. 

Let Pk stands for the mass of an atom in the probability function of arm k G K at the point -Jlf. (if 
there is one), then we note that 

/ OO 

fk''{k)dk = lk {Fkikk)-Pk)+llPk + l-Fk{kk) + A{p* - pl+e)^ = 4>(7fc,7fe), 

-00 

but, since Fkljlk) > 1 - Ae^, for 7 ^ = 7 ^ it follows that ^ 1 - since $( 7 ^, 7 ^) 

increases in 7 ^ it is obtained that 7 ^ > 7 ^. Finally, it follows that in the case of Cq < {p* — p^. + e), 

Ik <ll, 

and in the case of cq > {p* — /ij + e), 

7fc < kmn[yl,yl) , 

where 

7 fc = 1 - 2 ^ (min (ep, p* - pi + e))^ 

If hypothesis Hk (k Oj /s true, then pi > pi + e for all I f k, hence the algorithm should 
provide a reward from arm k with probability larger than 1 — (5. We use Ejf and Pjf to denote the 
expectation and probability, respectively, under the algorithm being considered and hypothesis Hk. 
Further, for every k G K let 

4(1 - 7 fc) (lej) ’ 

and let Tk stands for the number of samples from arm k. 

Suppose now that our algorithm is (e, 5)-correct under Hq, and that EQ\Tk\ < tkfor some k G K. 
We will show that this algorithm cannot be (e, S)-correct under hypothesis Hk. Therefore, an (e, <5)- 
correct algorithm must have Eq [Tk] > tk for all k G K. 

Define the following events: 

• Ak = {Tk < 4ife}. It easily follows from Atk (l — P^ {Ak)) < E^lTk] that if Eq [Tk] < 
tk, then P^iAk) > 
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• Let Bk stand for the event under which the chosen arm at termination is k, and 
for its complement. Since {B^r ) > ^ can hold for one arm at most, it follows that 

(B^) > i for every k £ K \ {k'} for some k'. 

• Let Ck to be the event under which all the samples obtained from arm k are on the interval 
{—oo,pI]. Clearly, Pg^(C'/c) = 1. 


Define now the intersection event Sk = Ak H B^ n Ck- We have just shown that for every k £ 
K \ {k'j it holds that P^ [Ak) > P^ {Bk) P 5 i^k) = 1., from which it follows that 

Pq^ {Sk) > jfor k ^ k'. Further, observe that for every history of N samples for which the 

event Ck holds, it holds that (^Af) > We therefore obtain the following inequalities, 

at ^n 


Pk (Sf) > Pf {Sk) = 


H 


fjpH 


^ -4tfc ^ -L 

>iA ‘>j 


.-In 


165 > 


>lk*'^Po^{I{Sk)) 


1 - 

where in the last inequality we used the facts that (1 — e)' > e~^. 

We found that if an algorithm is {e, S)-correct under hypothesis Hq and Eo[Tk] < tk for some 
k ^ k', then, under hypothesis Hk this algorithm returns a sample that is smaller by at least e than 
the maximal possible reward with probability of 5 or more, hence the algorithm is not (e, 6)-correct. 
Therefore, any (e, 6)-correct algorithm must satisfy Po[2fe] > tkfor all of arms except possibly for 
one (namely, for the one k' for which Pq (P^) < \)- addition tk* > tk', where k* is the optimal 
arm (namely, = p*). Hence the lower bound is obtained. 

□ 


8 Appendix B 

Proof (Theorem|3ll. First, we define the following hypotheses: 

Ho-- /"“(m) = /(m), 


and 


Pi 


p{p) =^f{p)Jr^^l3{p* + e-p)'^ 


where, as in the proof of Theorem\I] f{p) is the probability density function of the unified arm, 1^ 
stand for the indicator function of the set A, and 7 is chosen such that f^^ {f)^^ = 1- 

Note that since for every xi,X 2 > 0 it follows that a;f + X 2 > {xi + X 2 )^ for /3 < 1, Assumption\J\ 
holds for hypothesis Hi. 


To further bound 7, note that 


Therefore, 


1 = 


/ OO 

p-{p)dp = y + 

-00 


AeP 

WV 


^ AeP 
\K\- 

If hypothesis Hi is true, the algorithm should provide a reward greater than p*. We use Ei and Pi 
(where I £ { 0 , l}j to denote the expectation and probability respectively, under the algorithm being 
considered and under hypothesis Hi. Now, let 


t = 


1 


4 ( 1 - 7 ) 


In ( 

55 


and recall that T stands for the total number of samples from the arm. 
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Now, we assume we run an algorithm which is (e, d)-correct under Hq and that Eq\T] < tfor this 
algorithm. We will show that this algorithm cannot be (e, 6)-correct under hypothesis Hi. Therefore, 
an (e, 5)-correct algorithm must have Eo[T] > t. 

Define the following events: 


• A = {T < 4t}. By the same consideration as in the proof of Theorem\I\(for the events 
{Ak\k^K), it follows that if Eq[T] < t, then Pij{A) > 

• Let B stand for the event under which the chosen sample is smaller or equal to pL, and 
for its complementary. Clearly, Pq (B) = 1. 

• We define the event C to be the event under which all the samples obtained from the unified 
arm are on the interval [— cxd , p*]. Clearly, Pd{C) = 1. 


Define now the intersection event S' = A n B^ n C. We have shown that PoiA) > PoiB) = 1 
and Po{C) = 1, from which it is obtained that Pq (S) > In addition, since for every history 
Hn of N samples, for which the event C holds, it is obtained that (^at) > 7'^, we have the 
following. 


Pi {B) > Pi (S) = Eo 


dPi 


KS) 


>7-"‘Po(/(^)) 


> ^^-4i > > S, 

4 4 


where in the last inequality we used the facts that (1 



We found that if an algorithm is {e,S)-correct under hypothesis Hq and Eq[T] < t, then, under 
hypothesis Hi this algorithm returns a sample that is smaller by at least e than the maximal possible 
reward with a probability of 5 or more, hence the algorithm is not {e,S)-correct. Therefore, any 
(e, S)-correct algorithm, must satisfy Eq\T\ > t. Hence the lower bound is obtained. 


□ 


9 Appendix C 


Proof (Theorem @1). Since sampling from the unified arm consists of choosing one arm out 
of the \K\ arms (with equal probability), and then, sampling from this arm, it follows that, 

F [p* — e) < (l — )• Also, we note that (1 — e)'r < for every e € (0,1]. Therefore, 


for N 


r-ln(i5)|if|-| 

I -- I 


+ 1 , 


P{Vk<p*-e) = (F(p* 


e)f< 



N 

<s, 


(13) 


where is the largest reward observed among the first N samples. Hence, the algorithm is (e, <5)- 
correct. The bound on the sample complexity is immediate from the definition of the algorithm. 


□ 
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